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METHOD AND APPARATUS FOR PERFORMING LAYOUT-DRIVEN 
OPTIMIZATIONS ON FIELD PROGRAMMABLE GATE ARRAYS 



FIELD OF THE INVENTION 
5 The present invention relates to the field of field programmable gate arrays (FPGAs). 

More specifically, the present invention relates to a method and apparatus for performing layout- 
driven optimizations on systems on FPGAs using tools such as electronic design automation 
(EDA) tools. 

10 BACKGROUND 

FPGAs may be used to implement large systems that include millions of gates and 
megabits of embedded memory. Of the tasks required in managing and optimizing a design, 
placement of components on the FPGAs and routing connections between components on the 
FPGA utilizing available resources can be the most challenging and time consuming. In order to 

15 satisfy placement and timing specifications, several iterations are often required to determine how 
components are to be placed on the target device and which routing resources to allocate to the 
components. The complexity of large systems often requires the use of EDA tools to manage and 
optimize their design onto physical target devices. Automated placement and routing algorithms 
in EDA tools perform the time consuming task of placement and routing of components onto 

20 physical devices. 

The design of a system is often impacted by the connection delays routed along the 
programmable interconnect of the target device. The interconnect provides the ability to 
implement arbitrary connections, however, it includes both highly capacitive and resistive 
elements. The delay experienced by a connection is affected by the number of routing elements 

25 used to route the connection. Traditional approaches for reducing the delay were targeted at 
improving the automated placement and routing algorithms in the EDA tools. Although some 
reductions in delay were achieved with these approaches, the approaches were not able to 
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perform further improvements to the system after the placement and routing phases. It is often 
only after the placement and routing phases of the FPGA computer automated design (CAD) flo 
when connection delays are fully known. 

Thus, what is needed is an efficient method and apparatus for performing layout-driven 
optimizations on FPGAs after the placement and routing phases of the FPGA CAD flow. 



SUMMARY 

According to an embodiment of the present invention, critical components of a system 
used for processing a critical signal are identified and expanded after the placement and routing 
of components in the system. Expansion includes making duplicate copies of the components 

5 associated with the critical signal. The copies of the components are used to generate pre- 
computed values in response to possible values of the critical signal. An appropriate pre- 
computed value may be selected in response to the critical signal when it arrives. According to 
an embodiment of the present invention, placement of the duplicate copies of the components is 
attempted at preferred locations that are identified. If illegalities in placement exist, non-critical 

10 components are shifted in order to satisfy the preferred locations and produce a legal placement. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The features and advantages of the present invention are illustrated by way of example 
and are by no means intended to limit the scope of the present invention to the particular 
embodiments shown, and in which: 

Figure 1 is a flow chart illustrating a method for designing a system according to an 
embodiment of the present invention; 

Figure 2 illustrates a target device utilizing FPGAs according to an embodiment of the 

present invention; 

Figure 3 illustrates a LAB according to an embodiment of the present invention; 

Figure 4 is a flow chart illustrating a method for performing layout-driven optimization 
according to an embodiment of the present invention; 

Figure 5 illustrates critical path traceback performed on exemplary vertices and edges 
according to an embodiment of the present invention; 

Figure 6 illustrates the transitive fanouts associated with a critical signal according to an 
embodiment of the present invention; 

Figure 7 illustrates the duplication of critical components according to an embodiment of 
the present invention; 

Figure 8 illustrates logic levels relative to signal x for the purpose of controlling vertex 
duplication according to an embodiment of the present invention; 

Figure 9 illustrates an example of unused vertices according to an embodiment of the 
present invention; 

Figure 10 is a flow chart illustrating a method for performing incremental placement 
according to an embodiment of the present invention; 

Figure 11 illustrates fanin, fanout, and sibling relationship move proposals according to 
an embodiment of the present invention; 



Figure 12 illustrates an exemplary critical vector move proposal according to an 
embodiment of the present invention; 

Figure 13 illustrates horizontal and vertical cut-lines used for local congestion estimation 
according to an embodiment of the present invention; 

Figure 14 is a flow chart illustrating a method for performing incremental placement 
utilizing directed hill-climbing according to an embodiment of the present invention; 

Figure 15 illustrates a component trapped in a local minima according to an embodiment 

of the present invention; and 

Figure 16 illustrates basin-filling according to an embodiment of the present invention. 
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DETAILED DESCRIPTION 

Figure 1 is a flow chart that illustrates a method for designing a system according to an 
embodiment of the present invention. The method may be performed with the assistance of an 
EDA tool, for example. At 101, synthesis is performed. Synthesis includes generating a logic 
5 design of the system to be implemented by a target device. According to an embodiment of the 
present invention, synthesis generates an optimized logical representation of the system from a 
Hardware Description Language (HDL) design definition. The optimized logical representation 
of the system may include a representation that includes a minimized number of logic gates and 
logic elements required for the system. Alternatively, the optimized logical representation of the 
10 system may include a representation that has a reduced depth of logic and that generates a lower 
signal propagation delay. 

Figure 2 illustrates an exemplary target device 200 utilizing FPGAs according to an 
embodiment of the present invention. The present invention may be used to design a system onto 
the target device 200. According to one embodiment, the target device 200 is a chip having a 
15 hierarchical structure that may take advantage of wiring locality properties of circuits formed 

therein. The lowest level of the hierarchy is a logic element (LE) (not shown). An LE is a small 
unit of logic providing efficient implementation of user logic functions. According to one 
embodiment of the target device 200, an LE may include a 4-input lookup table (LUT) with a 
configurable flip-flop. 

20 The target device 200 includes a plurality of logic-array blocks (LABs). Each LAB is 

formed from 10 LEs, LE carry chains, LAB control signals, LUT chain, and register chain 
connection lines. LUT chain connections transfer the output of one LE's LUT to the adjacent LE 
for fast sequential LUT connections within the same LAB. Register chain connection lines 
transfer the output of one LE's register to the adjacent LE's register within a LAB. LABs are 

25 grouped into rows and columns across the target device 200. A first column of LABs is shown as 
210 and a second column of LABs is shown as 21 1. 
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The target device 200 includes memory blocks (not shown). The memory blocks may be, 
for example, dual port random access memory (RAM) blocks that provide dedicated true dual- 
port, simple dual-port, or single port memory up to various bits wide at up to various frequencies. 
The memory blocks may be grouped into columns across the target device in between selected 
5 LABs or located individually or in pairs within the target device 200. 

The target device 200 includes digital signal processing (DSP) blocks (not shown). The 
DSP blocks may be used to implement multipliers of various configurations with add or subtract 
features. The DSP blocks include shift registers, multipliers, adders, and accumulators. The DSP 
blocks may be grouped into columns across the target device 200. 
10 The target device 200 includes a plurality of input/output elements (IOEs) (not shown). 

Each IOE feeds an I/O pin (not shown) on the target device 200. The IOEs are located at the end 
of LAB rows and columns around the periphery of the target device 200. Each IOE includes a 
bidirectional I/O buffer and a plurality of registers for registering input, output, and output-enable 
signals. When used with dedicated clocks, the registers provide performance and interface 
15 support with external memory devices. 

The target device 200 includes LAB local interconnect lines 220-221 that transfer signals 
between LEs in the same LAB. The LAB local interconnect lines are driven by column and row 
interconnects and LE outputs within the same LAB. Neighboring LABs, memory blocks, or DSP 
blocks may also drive the LAB local interconnect lines 220-221 through direct link connections. 
20 The target device 200 also includes a plurality of row interconnect lines ("H-type wires") 

230 that span fixed distances. Dedicated row interconnect lines 230, that include H4 231, H8 
232, and H24 233 interconnects, route signals to and from LABs, DSP blocks, and memory 
blocks within the same row. The H4 231, H8 232, and H2 233 interconnects span a distance of 
up to four, eight, and twenty-four LABs respectively, and are used for fast row connections in a 
25 four-LAB, eight-LAB, and twenty-four-LAB region. The row interconnects 230 may drive and 
be driven by LABs, DSP blocks, RAM blocks, and horizontal IOEs. 
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The target device 200 also includes a plurality of column interconnect lines ("V-type 
wires") 240 that operate similarly to the row interconnect lines 230. The column interconnect 
lines 240 vertically routes signals to and from LABs, memory blocks, DSP blocks, and IOEs. 
Each column of LABs is served by a dedicated column interconnect, which vertically routes 
5 signals to and from LABs, memory blocks, DSP blocks, and IOEs. These column interconnect 
lines 240 include V4 241 , V8 242, and V16 243 interconnects that traverse a distance of four, 
eight, and sixteen blocks respectively, in a vertical direction. 

Figure 2 illustrates an exemplary embodiment of a target device. It should be appreciated 
that a system may include a plurality of target devices, such as that illustrated in Figure 2, 
10 cascaded together. It should also be appreciated that the target device may include programmable 
logic devices arranged in a manner different than that on the target device 200. A target device 
may also include components other than those described in reference to the target device 200. 
Thus, while the invention described herein may be utilized on the architecture described in Figure 
2, it should be appreciated that it may also be utilized on different architectures, such as those 
15 employed by Altera® Corporation in its APEX™, and Mercury™ family of chips and those 
employed by Xilinx®, Inc. in its Virtex™ and Virtex™ H line of chips. 

Figure 3 illustrates a LAB or clustered logic block 300 according to an embodiment of 
the present invention. The LAB 300 may be used to implement any of the LABs shown in Figure 
2. LEs 301-303 illustrates a first, second, and tenth LE in the LAB 300. The LEs 301-303 each 
20 have a 4-input lookup table 311-313, respectively, and a configurable register 321-323s, 

respectively, connected at its output. The LAB 300 includes a set of input pins 340 and a set of 
output pins 350 that connect to the general-purpose routing fabric so that LAB can communicate 
with other LABs. The inputs to lookup tables 311-313 can connect to any one of the input pins 
340 and output pins 350 using the appropriate configuration bits for each of the multiplexers 330. 
25 The number of LEs, n E , input pins, n b and output pins, n 0 in a LAB impose important 

architectural constraints on a system. In addition, since a single clock line 361 and a single 
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asynchronous set/reset line 362 is attached to each configurable register 321-323, the 
configurable registers 321-323 must be clocked by the same signal and initialized by the same 
signal. The number of clock lines available in a LAB is represented by no The number of reset 
lines available in a LAB is represented by n R . 

Referring back to Figure 1, at 102, the optimized logical design of the signal is mapped. 
Mapping includes determining how to implement components such as logic gates and logic 
elements in the optimized logic representation with specific resources on a target device. 
According to an embodiment of the present invention, a netlist is generated from mapping. The 
netlist illustrates how the resources of the target device are utilized to implement the system. The 
netlist may, for example, include a representation of the components on the target device and how 
the components are connected. 

At 103, the mapped logical system design is placed. Placement includes fitting the 
system on the target device by determining which resources on the target device is to be used for 
specific logic gates, logic elements, and connections between components. The placement 
procedure may be performed by a placer in an EDA tool that utilizes placement algorithms. 
According to an embodiment of the present invention, a user (designer) may provide input to the 
placer by specifying placement constraints. The constraints may include defining logic regions 
that group certain components of a system together. The components may be for example, digital 
logic, memory devices, or other components. The size of the logic regions may be determined by 
the user or by a sizing method. The placement of the logic regions may be determined by the user 
or by a placement method. 

At 104, layout-driven optimizations are performed. According to an embodiment of the 
present invention, routing delays for the connections on the netlist are estimated by calculating a 
fastest possible route. Timing-driven netlist optimization techniques may be applied to perturb 
the netlist to reduce the critical path(s). The netlist may be perturbed by the EDA tool performing 
the synthesis, mapping and placement. Alternatively, the netlist may be perturbed by a user of 
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the EDA tool, or by a third party. Perturbing the netlist may include adding, deleting, or moving 
components. 

According to an embodiment of the present invention, optimization of the layout of the 
system may be achieved by performing Shannon's Decomposition Theorem to critical sections of 
5 the system. Consider an n-input function f(x 0 , x n ). Shannon' s Decomposition Theorem 

allows the n-input function to be expressed in the following manner. 

f(X 0 , . . ., Xj, . . ., X n ) = X; f(X 0 , . . ., 0, . . ., X„) + X; f(x 0 , . . ., 1, . . ., x„) (1) 

In this embodiment, critical components of the system used for processing a critical 
10 signal, Xi , may be identified and expanded. The critical signal may be, for example, a signal that 
impacts the processing of many other signals in the system or a signal that may affect the timing 
of the system if the propagation delay of that signal is increased. In this embodiment, a critical 
path is a path from source to sink, that includes critical signals via components (vertices) and 
wires (edges). Expansion includes making duplicate copies of the components. The duplicate 

15 copies of the components generate pre-computed function values dependent on possible values of 
the critical signal. The pre-computed function values may be determined for Xj = 0 and X| = 1 . An 
appropriate pre-computed function value may be selected in response to the critical signal when it 
arrives. According to an embodiment of the present invention, preferred locations are identified 
for the duplicate copies of the components and the locations assigned to components of the 

20 existing system from the placement procedure are identified as preferred locations for the 
components. 

Figure 4 is a flow chart illustrating a method for performing layout-driven optimization 
according to an embodiment of the present invention. The method shown in Figure 4 may be 
used to implement 104 shown in Figure 1. At 401, timing analysis is performed on the system. 
25 For each edge, eg, that serves as a connection in the system, a criticality value crit(eg) is assigned 
based on the timing analysis. The criticality value of a connection indicates how significant the 
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connection impacts the processing at other components in the system. According to an 
embodiment of the present invention, the system on the target device may be represented by a 
directed acyclic graph, G(V,Eg), where V is a group of vertices representing the components or 
combinational elements, and Eg is a group of edges representing routing connections between 

5 vertices in the system. According to one embodiment an G-graph G e (V e ,Eg e ) is a subgraph of 
the combinational graph G(V,Eg), where the edge set Eg e is a set of edges that have criticality 
values within a value € of being the most critical, and V e are a set of vertices that have an 
adjacent edge having a criticality value within e of being the most critical. 

At 402, the critical signals in the system are identified. According to an embodiment of 

10 the present invention, a cost function is utilized. The cost function quantifies a number of critical 

or near critical paths that a particular vertex affects. Illustratively, this quantity is denoted with a 

label named "cpcount", or critical path count, for each vertex. The cpcount identifiers for each 

vertex are initially set to zero. For each sink vertex in G e , the following procedure is performed. 

procedure traceback (sink) 
15 begin 

u = sink 
while u + 0 

cpcount (u) = cpcount (u) + 1 

u = 0 

20 foreach eg wu e FANIN (u) 

choose fanin with maximum crit(eg wu ) and assign u = w 

end for 
end loop 
end procedure 

25 

This procedure takes a sink vertex and traverses the most critical fanins (FANIN) 
backwards to find a single critical path that involves the sink vertex. The criticality value is used 
to determine the most critical fanout at each vertex. Each vertex along the path traced backwards 
has its cpcount identifier incremented. After this procedure has been performed for all sink 
30 vertices, the vertices that have the higher cpcount values are determined to affect the larger 
numbers of critical sinks. It should be appreciated that this procedure is heuristic in nature as 



12 



there may be several different near critical paths that affect a sink vertex, instead of the single 
path that is traced backwards. Nevertheless, this procedure is an efficient and effective method 
for identifying veritces that affect the largest number of critical paths. 

Figure 5 illustrates the critical path traceback procedure performed on exemplary vertices 
5 and edges in the system according to an embodiment of the present invention. Nodes 5 10-520 
represent vertices in the system. The solid arrow lines represent edges that connect the nodes. 
Each edge has its corresponding criticality value labeled next to it. The dashed arrow lines 
represent tracebacks from critical sink vertices. In this example, nodes 516 and 518 affects 4 
critical sinks and thus the signals associated with nodes 516 and 518 are determined to have an 
10 associated cpcount value of 4. It should be appreciated that signals having an associated cpcount 
value over a predetermined threshold value may be designated as being critical signals and be 
prioritized in order of their degree of criticality. 

According to an embodiment of the present invention, the critical signals in the system 
are sorted and prioritized according to their associated cpcount value. The critical signal with the 
15 highest cpcount value is designated to be the most critical of the critical signals. The critical 
signal with the lowest cpcount value over the predetermined threshold value is designated the 
least critical of the critical signals. According to an alternate embodiment of the present 
invention, a cost function is utilized to determine the degree of criticality of the critical signals. 
In this embodiment, the cost function takes into account the cpcount value and other criteria. 
20 At 403, components associated with a critical signal are identified. According to an 

embodiment of the present invention, to identify all critical vertices that are affected by a critical 
signal, the transitive fanouts of the critical signal are examined. 

Consider the following illustrative example. Figure 6 illustrates a portion of 
combinational graph G(V,Eg) where signal x is identified as being a critical signal. A vertex v is 
25 a transitive fanout of x if there exists a path from x to v in the subgraph G G . This may be denoted 
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as v € TF e (x). In Figure 6, the transitive fanouts of signal x in G G are encapsulated in rectangle 
labeled 610. The signal x affects two critical signals y(. . x, . . .) and z(. . x, . . .)• 

Referring back to Figure 4, at 404, it is determined whether sufficient slack exists for 
performing duplication on the components associated with the critical signal While applying 

5 Shannon' s Decomposition Theorem may be beneficial for signals emanating from x, the side 

effects of the operation need to be considered. Consider the fanins labeled i r i 5 in Figure 6. After 
applying Shannon's Decomposition Theorem to the components associated with the critical 
signals, signals downstream from these components experience an extra level of logic delay due 
to select logic added at the sink nodes. This extra level of logic delay is acceptable as long as the 

10 slack on the connections i r i 5 is greater than the amount of delay introduced by the select logic 
and the routing delay needed to connect to the select logic. The side-input set I may be used to 
represent the fanin edges of TF € (x) whose source vertex is not an intermediate variable in TF e (x). 
I includes all external input edges to TF e (x). According to an embodiment of the present 
invention, Shannon's Decomposition Theorem may be applied as long as the following condition 

15 is satisfied. 

Vi e I, slack© > selector delay + routing delay to selector (2) 
If sufficient slack exists, control proceeds to 405. If sufficient slack does not exist, 
control returns to 403 where components associated with a next critical signal are identified. 

At 405, copies of the components associated with the critical signal are generated. A first 
20 and second copy of the vertices identified in the transitive fanout of the critical signal are made. 
Referring to the example illustrated in Figure 6, a first copy of the vertices will be used to 
evaluate y and z for x = 0. The second copy of the vertices will be used to evaluate y and z for x 
= 1. Consider a vertex v e TF e (x). A duplicate version is required to evaluate y and z for x = 
0. This duplicated vertex is denoted as v 0 . A duplicate version is required to evaluate y and z for 
25 x = 1. This duplicated vertex is denoted as v,. Figure 7 illustrates the duplication of components 
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associated with a critical signal according to an embodiment of the present invention. Vertices in 
rectangle area 710 represent the first copy of the components. Vertices in rectangle area 720 
represent the second copy of the components. 

At 405, edges are also added to connect the appropriate vertices together. For each 
5 vertex v 6 TF e (x), the following procedure is performed to generate the required edges. For 
every edge eg uv € FANIN(v), if u e TF e (x), create an edges from u 0 to v 0 and from u, to v,. 
Referring to the example in Figure 7, this procedure wires together the intermediate signals used 
to compute y and z for x = 0 and x = 1. For every edge eg uv G FANIN(v), if u £ TF G (x), create a 
new edge from u to v 0 and a new edge from u to v x . Referring to the example in Figure 7, this 
10 procedure wires the input signals into the components that compute y(. . .0 . . .), y(. . . 1 . . .)» 
z(...0...),and z(...l...). 

At 405, a selector is also added to select an appropriate output from the first and second 
copies. The selector selects the appropriate output in response to the critical signal which the 
decomposition was based. Referring to the example in Figure 7, the selectors are shown as 730 
15 and 731. 

The procedure described involves the duplication of each vertex in TF e (x). It should be 
appreciated that this set size may be controlled by redefining the critical transitive fanout set. 
According to an embodiment of the present invention, each vertex v <= TF e (x) is associated with 
a label l(v) that is set to a maximum number of logic levels between x and v. The set TF e (x,D) 

20 represents all vertices v where there exists a path from x to v and l(v) <D. Figure 8 illustrates 
logic levels relative to signal x for the purposes of controlling vertex duplication according to an 
embodiment of the present invention. The value of D controls the tradeoff between the amount of 
duplication allowed and the number of levels of logic delay removed from critical paths. It 
should be appreciated that TF e (x,) may be replaced with TF G (x,D) for 403-406. According to one 

25 embodiment, D has the value 3. 
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Referring back to Figure 4, at 406, unused resources from the original system design are 
removed. According to an embodiment of the present invention, unused components and unused 
wires for routing connections associated with the critical signal are removed. When the transitive 
fanout of x, v e TF G (x), is duplicated, copies v 0 and vi are generated for every v. Referring to 

5 the example shown in Figure 6, some of the non-duplicated components are no longer needed 

because the first and second copies serve to produce the functions y and z. However, if any of the 
components are vertices v e TF e (x) is used as an input to another function nc(. . . v. . .) such that 
nc g TF e (x), then the vertex v cannot be removed or deleted. Figure 9 illustrates exemplary 
components associated with a critical signal x, where some components may be removed and 

10 others may not according to an embodiment of the present invention. Thus, a vertex v e TF e (x) 
may be removed only if TF(v) c TF e (x). 

At 407, the design for the system is evaluated to determine whether vertex collapsing 
may be performed. Depending on the architecture for the target device, multiple vertices may be 
implemented with a single vertex. According to an embodiment of the present invention, the 

15 target device implements a logic element having a 4-input lookup table. Thus in this 

embodiment, vertices may be arbitrarily collapsed into a single vertex as long as the new vertex 
requires four or fewer inputs. Collapsing multiple vertices into a single vertex reduces the 
number of levels of logic delay for a signal and recovers some of the area utilized for component 
duplication. 

20 Referring back to Figure 1, at 105, incremental placement is performed. The changes to 

the netlist generated from layout-driven optimization are placed on the layout of the existing 
system placed at 103. Incremental placement involves evaluating resources on a target device 
such as LABs that have architectural violations or illegalities from layout-driven optimizations. 
Incremental placement attempts to perturb the preferred locations as little as possible to ensure 

25 that the final placement respects all architectural constraints. Incremental placement attempts to 
identify non-critical LEs that may be moved from their preferred locations to resolve architectural 
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violations in order that truly critical elements may stay at their preferred locations. Incremental 
placement may be performed by an incremental placement engine (not shown) in the EDA tool 
that utilizes incremental placement algorithms. 

In performing incremental placement, an architectural description of the target device, A, 

5 and a netlist, N(E,C), that includes a set of logic elements, E, and a set of connections, C, is 
processed. Each element, e, is associated with a preferred physical location, (p x (e), p y (e)). 
According to an embodiment of the present invention, all atoms of the netlist have a preferred 
location. Incremental placement generates a set of mapped locations, M, for each logic elements 
in N. Incremental placement tries to find a mapping from preferred locations to mapped 

10 locations, P->M, such that the mapped locations are architecturally feasible as well as being 
minimally disruptive. The definition of minimal disruption depends on the goal of netlist 
optimization. 

According to an embodiment of the present invention, the goal of netlist optimization is 
to optimize timing of the system. In this embodiment, T(S) represent an estimate of the critical 
15 path delay if all logic elements in E are mapped to (s x (e), s y (e)). The estimate may ignore the 
legality of locations and may be computed assuming a best case route is possible for each 
connection. In this example, P->M is minimally disruptive if incremental placement minimizes 
{T(M)-T(P)}. Any logic element can be moved from its preferred location as long as it does not 
degrade the critical path. According to one embodiment, routing area is also tracked to control 
20 excessive routing congestion. In this embodiment, A(S) represents the routing area consumed if 
the logic elements are mapped to (s x (e), s y (e)). Minimal disruptiveness is satisfied by minimizing 
the relationships shown below. 

{T(M)-T(P)} + (A(M)-A(P)} (3) 
Figure 10 is a flow chart illustrating a method for performing incremental placement 
25 according to an embodiment of the present invention. The method described in Figure 10 may be 
used to perform incremental placement as shown as 105 in Figure 1. At 1001 proposed moves for 
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all LEs in a LAB having architectural violations are generated. According to an embodiment of 
the present invention, proposed moves may include a move-to-fanin, move-to-fanout, move-to- 
sibling, move-to-neighbor, move-to-space, a move towards a critical vector, and other moves. A 
move-to-fanin involves moving an LE to a LAB that is a fanin of the LE. A move-to-fanout 
involves moving an LE to a LAB that is a fanout of the LE. A move-to-sibling involves moving 
an LE to a LAB that is fanout of a LAB that fans in to the LAB of the LE. 

Figure 11 illustrates examples of a move-to-fanin, move-to-fanout, and move-to-sibling. 
When a first LE in a first LAB transmits a signal to a second LE in a second LAB, the first LAB 
is said to be a fanin of the second LE. When a first LE in a first LAB receives a signal from a 
second LE in a second LAB, the first LAB is said to be a fanout of the second LE. When a first 
LE from a first LAB receives a signal from a second LE from a second LAB that also transmits to 
a third LE in a third LAB, the first LAB and the third LABs are said to be siblings. Blocks 1 101- 
1 109 illustrates a plurality of LABs. Each of the LABs 1 101-1 109 has a number of shown LEs. 
A plurality of arrows 1 1 1 1-1 1 18 are shown to illustrate the direction of a signal transmitted 
between LEs. Relative to LAB 1 106, LABs 1 101-1 104 are considered fanins, LABs 1 105 and 
1107 are considered siblings, and LABs 1108 and 1109 are considered fanouts. 

Proposed moves may also include move-to-neighbor, move-to-space, and move towards 
critical vector. A move-to-neighbor involves moving an LE to an adjacent LAB. A move-to- 
space involves a move to any random free LE location in a target device. A move towards 
critical vector involves moving an LE towards a vector that is computed by summing the 
directions of all critical connections associated with the moving LE. Figure 12 illustrates an 
exemplary critical vector 1201. Vector 1201 is the critical vector of LE 121 1 which has critical 
connections to LEs 1212 and 1213, and a non-critical connection with LE 1214. 

Referring back to Figure 10, at 1002, a current placement of LEs in a LAB with 
architectural violations and proposed moves of the LEs in the LAB are evaluated by a cost 
function. The cost function may include parameters which measure the legality of a LAB (cluster 
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legality cost), timing (timing cost), and an amount of routing resources that is required for a 
placement (wirelength cost). According to an embodiment of the present invention, the cost 
function guides the reduction of architectural violations while ensuring minimal disruption. This 
cost function, C, is illustrated with the relationship shown below. 
5 C = K L * ClusterCost + K T * TimingCost * K w * WirelengthCost (4) 

K L ,K T , and K w represent weighting coefficients that normalize the contributions of each 
parameter. It should be appreciated that other parameters may be used in addition to or in place 
of the parameters described. 

The cluster legality cost is a cost associated with each LAB CL|. This cost may be 
10 represented as shown below. 

ClusterCost(CLi) = kEi * legality (CU n E ) + 

Kli * legality (CU nO + 
kRi * legality (CU n R ) + 
kOi * legality (CU n 0 ) + 
15 kCi * legality (CU n c ) (5) 

The legality (CU . ) function returns a measure of legality for a particular constraint. A 

value of 0 indicates legality, while any positive value is proportional to the amount to which the 

constraint has been violated. Functions legality (CU n E ), legality (CU nO , legality (CU n 0 ), 

legality (CU n R ), and legality (CU n c ) evaluate if LAB CLihas a feasible number of logic 

20 elements, inputs, outputs, reset lines and clock lines, respectively. According to an embodiment 

of the present invention, the weighting coefficients kU Kl h kOj, kR if and kCi are all initially set 

to 1 for every LAB CU in the target device. 

The timing cost associated with a placement may be represented as shown below. 

TimingCost = TC V pr + k DAM p * TC DA mp (6) 

25 The first parameter, TC VPR , is based upon the cost used by a versatile placement and 

routing (VPR) placer. This cost may be represented with the following relationship. 

TCvpr = 2 c crit(c) * delay(c) (7) 
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This function encourages critical connections to reduce delay while allowing non-critical 
connections to optimize wirelength and other optimization criteria. 

The second parameter, TC DA mp, operates as a damping component of the timing cost 
function and can be represented with the following relationships. 
5 TCqamp = £ c max(delay(c) - maxdelay(c), 0.0) (8) 

maxdelay(c) = delay(c) + a * slack(c) (9) 
The damping component penalizes any connection c whose delay(c) exceeds a maximum 
value maxdelay(c). This allows arbitrary moves to be made along a plateau defined by the 
maximum delays. The maxdelay values may be updated every time a timing analysis of the 
10 system is executed. The maxdelay values are controlled by the slack on the connection 

considered. The parameter a determines how much of a connection's slack will be allocated to 
the delay growth of the connection. Thus, the plateau is defined by the connection slack so that 
connection with large amounts of slack are free to move large distances in order to resolve 
architectural violations, while small slack values are relatively confined. 
15 Wirelength cost of a placement may be measured by determining a number of routing 

wires that cross cut-lines that outline a LAB. Figure 13 illustrates the utilization of cut-lines 
according to an embodiment of the present invention. Blocks 1301-1309 represent LABs having 
a plurality of shown LEs. Horizontal cut-lines 1311 and 1312 and vertical cut-lines 1313 and 
1314 are placed in each horizontal channel of a target device. Cut-lines provide a method to 
20 measure congestion by finding the regions that have the largest number of routing wires 1321- 
1324. This measurement may be used to prevent the formation of localized congested areas that 
can cause circuitous routes. The total number of routing wires that intersect a particular cut may 
be calculated by finding all the signals that intersect a particular cut-line and summing the 
average crossing-count for each of these signal wires. The average crossing count for a signal 
25 may be computed using the following relationship. 

CrossingCount(net) = q(NumCLBlockPins(net)) (10) 
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The function q is given as a number of discrete crossing counts as a function of signal pin 
count. The argument to the function q is the number of clustered logic block pins used to wire the 
signal. With respect to the functions shown in (5)-(10), it should be appreciated that other types 
of functions may be used in addition or in place of the functions represented. 

5 Referring back to Figure 10, at 1003, it is determined whether the cost associated with 

any of the proposed moves is better than the cost associated with the current placement. The 
costs associated with the proposed moves and current placement may be obtained by using cost 
function values generated from using the cost function described with respect to 1002. If it is 
determined that the cost associated with any of the proposed moves is better than the cost 

10 associated with the current placement, control proceeds to 1004. If it is determined that the cost 
associated with any of the proposed moves is not better than the cost associated with the current 
placement, control proceeds to 1005. 

At 1004, the proposed move associated with the best cost is selected as the current 
placement. 

15 At 1005, it is determined whether any additional LABs in the system have architectural 

violations. If additional LABs in the system have architectural violations, control will move to 
one of these LABs and proceeds to 1001. If no additional LABs in the system have architectural 
violations, control proceeds to 1006 and terminates the procedure. According to an embodiment 
of the present invention, a counter may be used to track the number of proposed moves that have 

20 been generated, or the number of LEs or LABs that have had proposed moves generated. In this 
embodiment, when this number exceeds a threshold value, instead of proceeding to 1001, control 
terminates the procedure and returns an indication that a fit was not found. 

Figure 14 is a flow chart illustrating a method for performing incremental placement 
utilizing directed hill-climbing according to an embodiment of the present invention. The method 

25 described in Figure 14 may be used to perform incremental placement as shown as 105 in Figure 
1. At 1400, a loop iteration index, L, is set to 1. 
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At 1401 proposed moves for all LEs in a LAB having architectural violations are 
generated. According to an embodiment of the present invention, the proposed moves may be 
generated similarly as described in 1001 shown in Figure 10. The number of LEs having 
proposed moves generated is recorded. 
5 At 1402, a current placement of LEs in a LAB with architectural violations and proposed 

moves of the LEs in the LAB are evaluated by a cost function. According to an embodiment of 
the present invention, the evaluation performed may be similarly conducted as described in 1002 
of Figure 10. 

At 1403, it is determined whether the cost associated with any of the proposed moves is 
10 better than the cost associated with the current placement. The costs associated with the proposed 
moves and current placement may be obtained by using values generated from using the cost 
function described with respect to 1002. If the cost associated with any of the proposed moves is 
better than the cost associated with the current placement, control proceeds to 1404. If the cost 
associated with any of the proposed moves is not better than the cost associated with the current 
15 placement, control proceeds to 1405. 

At 1404, the proposed move associated with the best cost is selected as the current 
placement. 

At 1405, it is determined whether any additional LABs in the system have architectural 
violations. If additional LABs in the system have architectural violations, control will move to 
20 one of these LABs and proceeds to 1407. If no additional LABs in the system have architectural 
violations, control proceeds to 1406 and terminates the procedure. 

At 1407, it is determined whether the number of LEs that have proposed moves generated 
exceeds the value K where K is a predefined value. If the number of LEs that have proposed 
moves generated exceeds the value K, control proceeds to 1409. If the number of LEs that have 
25 proposed moves generated does not exceed the value K, control proceeds to 1408. 

At 1408, the loop iteration index, L, is incremented. Control returns to 1401. 
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At 1409, timing analysis is performed. According to an embodiment of the present 
invention, the values for maxdelay and crit(c), used for evaluating timing cost, are updated to 
reflect the current configuration of the system. 

At 1410, the cost function is updated. According to an embodiment of the present 

5 invention, weighting coefficients in the ClusterCost parameter are incremented in proportion to 
an amount of violation. Updating the cost function allows directed hill-climbing to be performed. 
Directed hill-climbing is a technique that is used for generating proposed moves when moves 
cannot be found to decreases the current cost of a placement. 

Figure 15 illustrates an example where directed hill-climbing may be applied. The target 

10 device 1500 includes a plurality of LABs 1501-1505 each having a plurality of shown LEs. In 
this example, LAB 1503 has one LE more than is allowed by its architectural specification. 
Every possible move attempt to resolve the architectural constraints of the center LAB 1503 
results in another architectural violation. If all architectural violations are costed in the same 
manner, then the method described in Figure 10 may have difficulties resolving the constraint 

15 violation. 

Figure 16 illustrates a two dimensional slice of the multi-dimensional cost function 
described. The current state 1601 represents the situation shown in Figure 15. No single move in 
the neighborhood of the current state finds a solution with a lower cost. However, the cost 
function itself could be modified to allow for the current state 1601 to climb the hill. The 

20 weighting coefficients of the cost function may be gradually increased for LABs that have 

unsatisfied constraints. A higher weight may be assigned to unsatisfied constraints that have been 
violated over a long period of time or over many iterations. This results in the cost function being 
reshaped to allow for hill climbing. The reshaping of the cost function has the effect of filling a 
basin where the local minima is trapped. Referring back to Figure 15, once the weighting 

25 coefficients have been increased for LAB 1503, a proposed move to one of the adjacent cluster 
may be made to allow for shifting the violation "outwards" to a free space. 
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Updating a cost function also allows for a quick convergence by preventing a 
phenomenon known as thrashing. Thrashing occurs when incremental placement is trapped in an 
endless cycle where an LE is moved between two points in the configuration space which both 
result in architectural violations. By increasing the cost or penalty for moving to the two points, a 
5 move to a third point would eventually be more desirable and accepted. 

Referring back to Figure 14, at 141 1, it is determined whether the loop index, L, is 
greater than a threshold value. If the loop index, L, is not greater than the threshold value, control 
proceeds to 1408. If the loop index, L, is greater than the threshold value, control proceeds to 
1412. 

10 At 1412, control terminates the procedure and returns an indication that a fit was not 

found. 

Referring back to Figure 1, at 106, it is determined whether additional restructuring needs 
to be performed. According to an embodiment of the present invention, it is determined whether 
additional critical signals exist that have not been processed. If additional critical signals exist, 
15 control returns to 104 to expand components of the system used for processing the next most 

critical signal among the remaining critical signals. If no additional critical signals exist, control 
proceeds to 107. 

At 107, routing of the system is performed. During routing, routing resources on the 
target device are allocated to provide interconnections between logic gates, logic elements, and 
20 other components on the target device. The routing procedure may be performed by a router in 
an EDA tool that utilizes routing algorithms. 

The incremental placement techniques disclosed allow logic changes to be incorporated 
into an existing system design without reworking placement of the entire system. The 
incremental placement techniques attempt to minimize disruption to the original placement and 
25 maintain the original timing characteristics. According to an embodiment of the present 

invention, a method for designing a system on a target device utilizing FPGAs is disclosed. The 
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method includes placing new LEs at preferred locations on a layout of an existing system. 
Dlegalities in placement of the components are resolved. According to one embodiment, 
resolving the illegalities in placement may be achieved by generating proposed moves for an LE, 
generating cost function values for a current placement of the LE and for placements associated 
with the proposed moves, and accepting a proposed move if its associated cost function value is 
better than the cost function value for the current placement. 

Figures 1,4, 10 and 14 are flow charts illustrating a method for designing a system on a 
PLD, a method for performing layout-driven optimization, and methods for performing 
incremental placement. Some of the techniques illustrated in these figures may be performed 
sequentially, in parallel or in an order other than that which is described. It should be appreciated 
that not all of the techniques described are required to be performed, that additional techniques 
may be added, and that some of the illustrated techniques may be substituted with other 
techniques. 

Embodiments of the present invention (e.g. exemplary process described with respect to 
Figures 1, 4, and 5) may be provided as a computer program product, or software, that may 
include a machine-readable medium having stored thereon instructions. The machine-readable 
medium may be used to program a computer system or other electronic device. The machine- 
readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, 
and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, flash 
memory, or other type of media/machine-readabie medium suitable for storing electronic 
instructions. 

In the foregoing specification the invention has been described with reference to specific 
exemplary embodiments thereof. It will, however, be evident that various modifications and 
changes may be made thereto without departing from the broader spirit and scope of the 
invention. The specification and drawings are, accordingly, to be regarded in an illustrative 
rather than restrictive sense. 
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